What do we learn about the data that is different from the previous plot?
What is easier and what is harder or impossible to learn from this arrangement?
Separate plots
# Make separate plots for males and females, focus on counts by categoryggplot(tb_ind, aes(x=year, y=count, fill=sex)) +geom_bar(stat="identity") +scale_fill_manual("Sex", values =c("#DC3220", "#005AB5")) +facet_grid(sex~age_group) +theme_bw()
Make a pie
# How to make a pie instead of a barchart - not straight forwardggplot(tb_ind, aes(x=year, y=count, fill=sex)) +geom_bar(stat="identity") +facet_grid(sex~age_group) +scale_fill_manual("Sex", values =c("#DC3220", "#005AB5")) +coord_polar() +theme_bw()
# Step 1 to make the pieggplot(tb_ind, aes(x =1, y = count, fill =factor(year))) +geom_bar(stat="identity", position="fill") +facet_grid(sex~age_group) +scale_fill_viridis_d("", option="inferno")
Pie chart
# Now we have a pie, note the mapping of variables# and the modification to the coord_polarggplot(tb_ind, aes(x =1, y = count, fill =factor(year))) +geom_bar(stat="identity", position="fill") +facet_grid(sex~age_group) +scale_fill_viridis_d("", option="inferno") +coord_polar(theta ="y")
TWO MINUTE CHALLENGE
What are the pros, and cons, of using the pie chart for this data?
Would it be better if the pies used age for the segments, and facetted by year (and sex)?
tb_inc_100k <-read_csv(here::here("data/TB_burden_countries_2025-07-22.csv")) |>filter(iso3 %in%c("USA", "AUS"))ggplot(tb_inc_100k, aes(y = iso3, x = e_inc_100k)) +stat_gradientinterval(fill ="darkorange") +ylab("") +xlab("Inc per 100k") +theme_ggdist()
ggplot(tb_inc_100k, aes(y = iso3, x = e_inc_100k)) +stat_halfeye(side ="right") +geom_dots(side="left", fill ="darkorange", color ="darkorange") +ylab("") +xlab("Inc per 100k") +theme_ggdist()
Mapping and data
Map thinning
Warping the map to display statistics
cartograms, hexagon tiling
Plots and statistical inference
Tidy data and random variables
Tidy data mirrors elementary statistics
Tabular form puts variables in columns and observations in rows
Not all tabular data is in this form
In this form, we can think about \(X_1 \sim N(0,1), ~~X_2 \sim \text{Exp}(1) ...\)
A statistic is a function on the values of items in a sample, e.g. for \(n\) iid random variates \(\bar{X}_1=\displaystyle\sum_{i=1}^n X_{i1}\), \(s_1^2=\displaystyle\frac{1}{n-1}\displaystyle\sum_{i=1}^n(X_{i1}-\bar{X}_1)^2\)
We study the behaviour of the statistic over all possible samples of size \(n\).
The grammar of graphics is the mapping of (random) variables to graphical elements, making plots of data into statistics
What is inference?
Inferring that what we see in the data at hand holds more broadly in life, society and the world.
Which plot definition would best match \(H_0:\)there is no difference in the distribution between the groups?
Some examples
Here are several null hypotheses.
What type of plot would you use to test each?
\(H_0:\) no association between x1 and x2
\(H_0:\) no difference between levels of cl
\(H_0:\) the distribution of x1 is XXX
\(H_0:\) no difference in the distribution of x1 b/w levels of cl
Let’s do it
# Make a lineup of mtcars data# 20 plots, one data, 19 nulls# Which one is different?set.seed(20190709)library(ggplot2)ggplot(lineup(null_permute('mpg'), mtcars), aes(mpg, wt)) +geom_point() +facet_wrap(~ .sample)
Lineup
{{< fa mortar-pestle >}} Mix the data plot
into a field of null plots
null hypothesis is no association between the two statistics
null generating mechanism: residual rotation
# Assessing model fit, using a lineup of residual plots: 19 nulls + 1 resid plot# Structure in the residual plot corresponding to less than random variation?# Nulls are generated by `rotating` residuals after model fit.tips <-read_csv("http://www.ggobi.org/book/data/tips.csv")x <-lm(tip ~ totbill, data = tips)tips.reg <-data.frame(tips, .resid =residuals(x), .fitted =fitted(x))ggplot(lineup(null_lm(tip ~ totbill, method ='rotate'), tips.reg)) +geom_point(aes(x = totbill, y = .resid)) +facet_wrap(~ .sample)
Goodness-of-fit & residuals
Let’s talk TB
Earlier:
Across all ages, and years, the proportion of males having TB is higher than females
These proportions tend to be higher in the middle age groups, for all years.
Relatively similar proportions occur across years.
Null hypothesis
Plot count against year, separately for each age group, coloured by sex.
Colouring by sex \(\Rightarrow\) primary comparison
Plot shows proportion of sex, given age group and year
\(H_0\): TB occurs equally among men and women, regardless of age and year.
\(H_A\): It doesn’t.
TB Lineup
# Make expanded rows of categorical variables matching the # counts of aggregated data. Sex needs to be converted to 0, 1# to match binomial output.tb_us_long <-uncount(tb_us, count)tb_us_long <- tb_us_long |>mutate(sex01 =ifelse(sex=="m", 0, 1)) |>select(-sex)# Generate a lineup of n=3, randomly choose the data position.# Compute counts again.pos =sample(1:3, 1)l <-lineup(null_dist(var="sex01", dist="binom", list(size=1, p=0.5)), true=tb_us_long, n=3, pos=pos)l <- l |>group_by(.sample, year, age) |>count(sex01)
TB Lineup
ggplot(l, aes(x = year, y = n, fill =factor(sex01))) +geom_bar(stat ="identity", position ="fill") +facet_grid(.sample ~ age) +scale_fill_brewer(palette="Dark2") +theme(legend.position="none")
TB Lineup
A more complicated null
\(H_0\): Rates are the same across sex, regardless of age and year. \(H_A\): They aren’t.
Create lineup, with null data sampled from a Binomial() distribution with the sample proportion as \(p\)
3
Compute aggregate results
# A tibble: 2 × 2
sex count
<chr> <dbl>
1 f 25915
2 m 55640
TB Lineup
ggplot(l, aes(x = year, y = n, fill =factor(sex01))) +geom_bar(stat ="identity", position ="fill") +facet_grid(.sample ~ age) +scale_fill_brewer(palette="Dark2") +theme(legend.position="none")
TB Lineup
Danger zone
\(H_0\) is determined based on the plot type
\(H_0\) is not based on the structure seen in the data set
Null data creation method does not match characteristics of original sample other than that in \(H_0\)
A map lineup example
Does one map show a spatial trend?
# Read xlsx spreadsheet on cancer incidence in USA, for a more# complex lneup example, a lineup of mapsload("data/fifty_states.rda")incd <-read_xlsx("data/IncRate.xlsx", skip=6, sheet=2) |>filter(!(State %in%c("All U.S. combined", "Kansas"))) |>select(State, `Melanoma of the skin / Both sexes combined`) |>rename(Incidence=`Melanoma of the skin / Both sexes combined`) |>mutate(Incidence =as.numeric(substr(Incidence, 1, 3)))# State names need to coincide between data setsincd <- incd |>mutate(State =tolower(State))# Choose a position pos <-6# Make lineup of cancer incidenceincd_lineup <-lineup(null_permute('Incidence'), incd, n=18, pos=pos)# Join cancer incidence data to map polygonsincd_map <-left_join(fifty_states, filter(incd_lineup, .sample==1),by=c("id"="State"))for (i in2:18) { x <-left_join(fifty_states, filter(incd_lineup, .sample==i),by=c("id"="State")) incd_map <-bind_rows(incd_map, x)}# Remove Kansas - it was missing the cancer dataincd_map <- incd_map |>filter(!is.na(.sample))# Plot the maps as a lineupggplot(incd_map) +geom_polygon(aes(x=long, y=lat, fill = Incidence, group=group)) +expand_limits(x = incd_map$long, y = incd_map$lat) +coord_map() +scale_x_continuous(breaks =NULL) +scale_y_continuous(breaks =NULL) +labs(x ="", y ="") +scale_fill_viridis_b(option ="D") +theme(legend.position ="none", panel.background =element_blank()) +facet_wrap(~.sample, ncol=6)
# A tibble: 6 × 3
sex age count
<chr> <chr> <dbl>
1 m 15-24 17304
2 m 25-34 25460
3 m 35-44 23057
4 m 45-54 23751
5 m 55-64 20204
6 m 65+ 9554
In arrangement A, separate plots are made for age, and sex is mapped to the x axis.
Conversely, in arrangement B, separate plots are made for sex, and age is mapped to the x axis.
At which age(s) are the counts for males and females relatively the same?
Which plot makes this question easier to answer?
At which age(s) are the counts relatively similar across sex?
Which plot makes this easier? What do we learn from each? What’s the focus? What’s easy? What’s harder?
TWO MINUTE CHALLENGE
Write out a question that would be easier to answer from arrangement B.
Three Variables
Next, we have two different plots of TB incidence in Kenya, based on three variables:
tb_kn |>select(year, sex, age, count) |>head(10)
# A tibble: 10 × 4
year sex age count
<dbl> <chr> <chr> <dbl>
1 1995 m 15-24 203
2 1995 m 25-34 297
3 1995 m 35-44 306
4 1995 m 45-54 302
5 1995 m 55-64 228
6 1995 m 65+ 109
7 1995 f 15-24 160
8 1995 f 25-34 244
9 1995 f 35-44 282
10 1995 f 45-54 192
In plot type A, a line plot of counts is drawn separately by age and sex, and year is mapped to the x axis.
Conversely, in plot type B, counts for sex, and age are stacked into a bar chart, separately by age and sex, and year is mapped to the x axis
Is the trend for females generally decreasing over time? Which plot makes this easier?
Which type of plot makes it easier to answer
Is the trend for females generally decreasing over time?
TWO MINUTE CHALLENGE
What are the pros and cons of each way of displaying the same information? Should specific limits on axes be made?
Should the limits of the y axis in plot A include 0 (zero)?
TWO MINUTE CHALLENGE
Plot A shows the proportion as a line plot.
Plot B shows stacked bars scaled to 100% for females and males.
Is there an age effect in the proportion of incidence by gender? Is there a temporal trend in the proportions?
Perceptual principles
Hierarchy of mappings
Pre-attentive: some elements are noticed before you even realise it.
Color palettes: qualitative, sequential, diverging.
Proximity: Place elements for primary comparison close together.
Change blindness: When focus is interrupted differences may not be noticed.
Hierarchy of mappings
Position - common scale (BEST)
Position - nonaligned scale
Length, direction, angle
Area
Volume, curvature
Shading, color (WORST)
(Cleveland, 1984; Heer and Bostock, 2009)
TWO MINUTE CHALLENGE
Come up with a plot type for each of the mappings.
Online checking tool coblis: upload an image and it will re-map the colors for different colour perception issues.
The package colorblind has color blind friendly palettes (Susan: but the colours are awful 😭).
Color blind simulation
Original colours
Color blind view
Pre-attentive
Can you find the odd one out?
Pre-attentive
Is it easier now?
Proximity
Place elements that you want to compare close to each other. If there are multiple comparisons to make, you need to decide which one is most important.
Mapping and proximity
Same proximity is used, but different geoms.
Which is better to determine the relative ratios of males to females by age?
Mapping and proximity
Same proximity is used, but different geoms.
Which is better to determine the relative ratios of ages by sex?
The hierarchy matters if the structure is weak or differences b/w groups are small.
Knowing how to use proximity is a valuable and rare skill
Use of colour: don’t over use
Too many colours
Mapping cts variable to colour to add another dimension
Core principles
Show the data!
Statistics are good if there’s too much data
Always plot the data for yourself to see the variability
One plot is never enough
Plot the data in different ways
Understand the relationships between variables
Your turn
This builds on the exercise from the previous session.
Using your choice of country, for example, Australia, make a set of plots to explore the TB incidence among males relative to females over different age groups for 2012.
Choose your best plot to answer this question: Is there a higher prevalence of TB among younger women in 2012?